Lab 07b: Decision tree regression

Introduction

This lab focuses on data modelling using decision tree and random forest regression. It's a direct counterpart to the linear regression modelling in Lab 06. At the end of the lab, you should be able to use scikit-learn to:

  • Create a decision tree regression model and a random forest regression model.
  • Use the models to predict new values.
  • Measure the accuracy of the models.

Getting started

Let's start by importing the packages we'll need. As usual, we'll import pandas for exploratory analysis, but this week we're also going to use the tree subpackage from scikit-learn to create decision tree models and the ensemble subpackage to create random forest models.


In [ ]:
%matplotlib inline
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict

Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas' read_fwf function:


In [ ]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'

df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                                          'acceleration', 'model year', 'origin', 'car name'])

Exploratory data analysis

According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

Let's start by taking a quick peek at the data:


In [ ]:
df.head()

As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:

Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.


In [ ]:
df = df.set_index('car name')

df.head()

According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'. Again, for simplicity, let's just drop these from the data set:


In [ ]:
df = df[df['horsepower'] != '?']

Usually, pandas is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:


In [ ]:
df.dtypes

We can correct this by converting the column values numbers manually, using pandas' to_numeric function:


In [ ]:
df['horsepower'] = pd.to_numeric(df['horsepower'])

# Check the data types again
df.dtypes

As can be seen, the data type of the horsepower column is now float64, i.e. a 64 bit floating point value.

According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas: all we need to do is use the get_dummies method, as follows:


In [ ]:
df = pd.get_dummies(df, columns=['origin'])

df.head()

As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a decsion tree regression model, we don't need to worry about the effects of multicollinearity, and so there's no need to drop one of the encoded variable columns as we did in the case of linear regression.

Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:


In [ ]:
df.describe()

Print a matrix of pairwise Pearson correlation values:


In [ ]:
df.corr()

Let's also create a scatter plot matrix:


In [ ]:
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));

Based on the above information, we can conclude the following:

  • Based on a quick visual inspection, there don't appear to be significant numbers of outliers in the data set. (We could make boxplots for each variable - but let's save time and skip it here.)
  • Most of the explanatory variables appear to have a non-linear relationship with the target.
  • There is a high degree of correlation ($r > 0.9$) between cylinders and displacement and, also, between weight and displacement.
  • The following variables appear to be left-skewed: mpg, displacement, horsepower, weight.
  • The acceleration variable appears to be normally distributed.
  • The model year follows a rough uniform distributed.
  • The cylinders and origin variables have few unique values.

For now, we'll just note this information, but we'll come back to it later when improving our model.

Data Modelling

Decision tree regression

Let's build a decision tree regression model to predict the MPG of a car based on its other attributes. scikit-learn supports decision tree functionality via the tree subpackage. This subpackage supports both decision tree regression and classification. We can use the DecisionTreeRegressor class to build our model.

DecisionTreeRegressor accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params method of the estimator (this works on any scikit-learn estimator), like this:


In [ ]:
DecisionTreeRegressor().get_params()

You can find a more detailed description of each parameter in the scikit-learn documentation.

Let's use a grid search to select the optimal decision tree regression model from a set of candidates. First, we define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.


In [ ]:
X = df.drop('mpg', axis='columns')  # X = features
y = df['mpg']                       # y = prediction target

algorithm = DecisionTreeRegressor(random_state=0)

# Build models for different values of min_samples_leaf and min_samples_split
parameters = {
    'min_samples_leaf': [1, 10, 20],
    'min_samples_split': [2, 10, 20]  # Min value is 2
}

# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

# Print the results 
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the decision tree regression model',
    xlabel='Error'
);

Our decision tree regression model predicts the MPG with an average error of approximately ±2.32 with a standard deviation of 3.16, which is similar to our final linear regression model from Lab 06. It's also worth noting that we were able to achieve this level of accuracy with very little feature engineering effort. This is because decision tree regression does not rely on the same set of assumptions (e.g. linearity) as linear regression, and so is able to learn from data with less manual tuning.

We can check the parameters that led to the best model via the best_params_ attribute of the output of our grid search, as follows:


In [ ]:
clf.best_params_

Random forest regression

Next, let's build a random forest regression model to predict the car MPGs to see if we can improve on our decision tree model. Random forests are ensemble models, i.e. they are a collection of different decision trees, each of which is trained on a random subset of the data. By combining trees with different characteristics, it's possible to form an overall model that can utilise the benefits of each, which often produces better results than using a single tree to model all the data. scikit-learn supports ensemble model functionality via the ensemble subpackage. This subpackage supports both random forest regression and classification. We can use the RandomForestRegressor class to build our model.

RandomForestRegressor accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params method of the estimator (this works on any scikit-learn estimator), like this:


In [ ]:
RandomForestRegressor().get_params()

As before, you can find a more detailed description of each parameter in the scikit-learn documentation.

Let's use a grid search to select the optimal random forest regression model from a set of candidates. First, we define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.


In [ ]:
X = df.drop('mpg', axis='columns')  # X = features
y = df['mpg']                       # y = prediction target

algorithm = RandomForestRegressor(random_state=0)

# Build models for different values of n_estimators, min_samples_leaf and min_samples_split
parameters = {
    'n_estimators': [2, 5, 10],
    'min_samples_leaf': [1, 10, 20],
    'min_samples_split': [2, 10, 20]  # Min value is 2
}

# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)  # K = 5

clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1)  # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)

# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0)  # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)

# Print the results 
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())

ax = (y - y_pred).hist()
ax.set(
    title='Distribution of errors for the random forest regression model',
    xlabel='Error'
);

As can be seen, our random forest regression model significantly outperforms our previous decision tree model as well as our linear regression model from Lab 06. Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. impurity measures, stopping criteria).